7. Data Aggregation

Overview

This chapter introduces you to the concept of aggregation and its implementation in MongoDB. You will learn how to identify the parameters and structure of the aggregate command, combine and manipulate data using the primary aggregation stages, work with large datasets using advanced aggregation stages, and optimize and configure your aggregation to get the best performance out of your queries.

Introduction

In the previous chapters, we learned the fundamentals of interacting with MongoDB. With these basic operations (insert, update, and delete), we can now begin exploring and manipulating our data as we would with any other database. We also observed how, by fully leveraging the find command options, we can use operators to answer more specific questions about our data. We can also sort, limit, skip, and project on our query to create useful result sets.

In more straightforward situations, these result sets may be enough to answer your desired business question or satisfy a use case. However, more complex problems require more complex queries to answer. Solving such problems with just the find command would be highly challenging and would likely require multiple queries or some processing on the client side to organize or link the data.

The basic limitation is where you have data contained in two separate collections. To find the correct data, you would have to run two queries instead of one, joining the data on the client or application level. This may not seem like a big problem, but as your application or dataset increases in scale, performance and complexity also grow. Wherever possible, it is ideal for the server to do all the heavy lifting, returning only the data we are looking for in a single query. This is where the aggregation pipeline comes in.

The aggregation pipeline does precisely what the name implies. It allows you to define a series of stages that filter, merge, and organize data with much more control than the standard find command. Beyond that, the pipeline structure of aggregation allows developers and database analysts to easily, iteratively, and quickly build queries on ever-changing and growing datasets. If you want to accomplish anything significant at scale in MongoDB, you'll need to write complex, multi-stage aggregation pipelines. In this chapter, we will learn exactly how to do that.

Note

For the duration of this chapter, the exercises and activities included are iterations on a single scenario. The data and examples are based on the MongoDB Atlas sample database called sample_mflix.

Consider a scenario in which a cinema company is running its annual classic movie marathon and is trying to decide what their lineup should be. They need a variety of popular movies meeting specific criteria to satisfy their customer base. The company has asked you to research and determine the films they should show. In this chapter, we will use aggregations to retrieve data given a complex set of constraints, and then transform and manipulate data to create new results and answer business questions across our entire dataset with a single query. This will help the cinema company decide what movies they should be showing to satisfy their customers.

It's worth noting that the aggregation pipeline is robust enough that there are many ways to accomplish the same task. The exercises and activities covered in this chapter are just one solution to the scenarios posed and can be solved using different patterns. The best way to master the aggregation pipeline is to consider multiple methods to solve the same problem.

aggregate Is the New find

The aggregate command in MongoDB is similar to the find command. You can provide the criteria for your query in the form of JSON documents, and it outputs a cursor containing the search result. Sounds simple, right? That's because it is. Although aggregations can become very large and complex, at their core, they are relatively simple.

The key element in aggregation is called the pipeline. We will cover it in detail shortly, but at a high level, a pipeline is a series of instructions, where the input to each instruction is the output of the previous one. Simply put, aggregation is a method for taking a collection and, in a procedural way, filtering, transforming, and joining data from other collections to create new, meaningful datasets.

Aggregate Syntax

The aggregate command operates on a collection like the other Create, Read, Update, Delete (CRUD) commands, like so:

use sample_mflix;

var pipeline = [] // The pipeline is an array of stages.

var options = {} // We will explore the options later in the chapter.

var cursor = db.movies.aggregate(pipeline, options);

There are two parameters used for aggregation. The pipeline parameter contains all the logic to find, sort, project, limit, transform, and aggregate our data. The pipeline parameter itself is passed in as an array of JSON documents. You can think of this as a series of instructions to be sent to the database, and then the resulting data after the final stage is stored in a cursor to be returned to you. Each stage in the pipeline is completed independently, one after another, until none are remaining. The input to the first stage is the collection (movies in the preceding example), and the input into each subsequent stage is the output from the previous stage.

The second parameter is the options parameter. This is optional and allows you to specify the details of the configuration, such as how the aggregation should execute or some flags that are required during debugging and building your pipelines.

The parameters in an aggregate command are fewer than those in the find command. We will cover options as the final topic of this chapter, so for now, we can simplify our command by excluding options completely, as follows:

var cursor = db.movies.aggregate(pipeline);

In the preceding example, rather than writing the pipeline directly into the command, we are saving the pipeline as a variable first. Aggregation pipelines can become very large and difficult to parse during development. It can sometimes be helpful to separate the pipeline (or even large sections of the pipeline) into separate variables for code clarity. Although recommended, this pattern is completely optional, and is similar to the following:

var cursor = db.movies.aggregate([])

It is recommended that you follow along with these examples in a code or text editor, saving your scripts and then copying and pasting them into the MongoDB shell. For example, say we create a file called aggregation.js with the following content:

var MyAggregation_A = function() {

    print("Running Aggregation Script Ch7.1");

    var pipeline = [];

      // This next line stores our result in a cursor.

    var cursor = db.movies.aggregate(pipeline);

      // This line will print the next iteration of our cursor.

    printjson(cursor.next())

};

MyAggregation_A();

Then, copying this code directly into the MongoDB shell returns the following output:

Figure 7.1: Results of the aggreagation (output truncated for brevity)

Figure 7.1: Results of the aggreagation (output truncated for brevity)

We can see in this output that once the MyAggregation_A.js function is defined, we only need to call that function again to see the results of our aggregation (in this case, a list of movies). You can call this function again and again without having to write the entire pipeline every time.

By structuring your aggregations this way, you will not lose any of them. It also has the added benefit of letting you load all your aggregations into the shell interactively as functions. However, you can also copy and paste the entire function into the MongoDB shell if you prefer or simply enter it interactively. In this chapter, we will use a mix of both methods.

The Aggregation Pipeline

As mentioned earlier, the key element in aggregation is the pipeline, which is a series of instructions to perform on the initial collection. You can think of the data as water flowing through this pipeline, being transformed and filtered at each stage until it is finally poured out the end of the pipeline as a result.

In the following diagram, the orange blocks represent the aggregation pipeline. Each of these blocks in the pipeline is referred to as an aggregation stage:

Figure 7.2: Aggregation pipeline

Figure 7.2: Aggregation pipeline

Something to note about aggregations is that, although the pipeline always begins with one collection, using certain stages, we can add collections further in the pipeline. We will cover joining collections later in this chapter.

Large multi-stage pipelines may look intimidating, but if you understand the structure of the command and the individual operations that can be performed at a given stage, then you can easily break the pipeline down into smaller parts. In this first topic, we will explore the construction of an aggregation pipeline, compare a query implemented using find with one created using aggregate, and identify some basic operators.

Pipeline Syntax

The syntax of an aggregation pipeline is very simple, much like the aggregate command itself. The pipeline is an array, with each item in the array being an object:

var pipeline = [

        { . . . },

        { . . . },

        { . . . },

];

Each of the objects in the array represents a single stage in the overall pipeline, with the stages being executed in their array order (top to bottom). Each stage object takes the form of the following:

{$stage : parameters}

The stage represents the action we want to perform on the data (such as limit or sort) and the parameters can be either a single value or another object, depending on the stage.

The pipeline can be passed in two ways, either as a saved variable or directly as a command. The following example demonstrates how the pipeline can be passed as a variable:

var pipeline = [

        { $match: { "location.address.state": "MN"} },

        { $project: { "location.address.city": 1 } },

        { $sort: { "location.address.city": 1 } },

        { $limit: 3 }

     ];

Then, typing in the db.theaters.aggregate(pipeline) command in the MongoDB shell will provide the following output:

MongoDB Enterprise atlas-nb3biv-shard-0:PRIMARY> var pipeline = [

... { $match: { "location.address.state": "MN"} },

... { $project: { "location.address.city": 1 } },

... { $sort: { "location.address.city": 1 } },

... { $limit: 3 }

... ];

MongoDB Enterprise atlas-nb3biv-shard-0:PRIMARY>

MongoDB Enterprise atlas-nb3biv-shard-0:PRIMARY> db.theaters.aggregate(pipeline)

{ "_id" : ObjectId("59a47287cfa9a3a73e51e94f"), "location" : { "address" : { "city" : "Apple Valley" } } }

{ "_id" : ObjectId("59a47287cfa9a3a73e51eb8f"), "location" : { "address" : { "city" : "Baxter" } } }

{ "_id" : ObjectId("59a47286cfa9a3a73e51e833"), "location" : { "address" : { "city" : "Blaine" } } }

MongoDB Enterprise atlas-nb3biv-shard-0:PRIMARY>

Passing it directly into the command, the output will look as follows:

MongoDB Enterprise atlas-nb3biv-shard-0:PRIMARY> db .theaters.aggregate([

... ... { $match: { "location.address.state": "MN"} },

... ... { $project: { "location.address.city": 1 } },

... ... { $sort: { "location.address.city": 1 } },

... ... { $limit: 3 }

... ... ]

... );

{ "_id" : ObjectId("59a47287cfa9a3a73e51e94f"), "location" : { "address" : { "city" : "Apple Valley" } } }

{ "_id" : ObjectId("59a47287cfa9a3a73e51eb8f"), "location" : { "address" : { "city" : "Baxter" } } }

{ "_id" : ObjectId("59a47286cfa9a3a73e51e833"), "location" : { "address" : { "city" : "Blaine" } } }

MongoDB Enterprise atlas-nb3biv-shard-0:PRIMARY>

As you can see, you get the same output using either method.

Creating Aggregations

Let's begin to explore the pipeline itself. The following code, when pasted in the MongoDB shell, will help us get a list of all the theaters in the state of Minnesota (MN):

var simpleFind = function() {

    // Find command using filter, project, sort and limit.

    print("Find Result:")

    db.theaters.find(

        {"location.address.state" : "MN"},

        {"location.address.city" : 1})

    .sort({"location.address.city": 1})

    .limit(3)

    .forEach(printjson);

}

simpleFind();

This will give us the following output:

MongoDB Enterprise atlas-nb3biv-shard-0:PRIMARY> simpleFind();

Find Result:

{

        "_id" : ObjectId("59a47287cfa9a3a73e51e94f"),

        "location" : {

                "address" : {

                        "city" : "Apple Valley"

                }

        }

}

{

        "_id" : ObjectId("59a47287cfa9a3a73e51eb8f"),

        "location" : {

                "address" : {

                        "city" : "Baxter"

                }

        }

}

{

        "_id" : ObjectId("59a47286cfa9a3a73e51e7e2"),

        "location" : {

                "address" : {

                        "city" : "Blaine"

                }

        }

}

This syntax should look very familiar by now. This is quite a simple command, so let's look at the steps involved:

  1. Match the theater collection to get a list of all theaters in the state of MN (Minnesota).
  2. Project only the city in which the theater is located.
  3. Sort the list by city name.
  4. Limit the result to the first three theaters.

Let's rebuild this command as an aggregation. Don't worry if this looks a little intimidating at first. We'll walk through it step by step:

var simpleFindAsAggregate = function() {

    // Aggregation using match, project, sort and limit.

    print ("Aggregation Result:")

    var pipeline = [

        { $match: { "location.address.state": "MN"} },

        { $project: { "location.address.city": 1 } },

        { $sort: { "location.address.city": 1 } },

        { $limit: 3 }

    ];

    db.theaters.aggregate(pipeline).forEach(printjson);

};

simpleFindAsAggregate();

You should see the following output:

Figure 7.3: Results of the aggregation (output truncated for brevity)

Figure 7.3: Results of the aggregation (output truncated for brevity)

If you run these two functions, you will get the same results. Remember, both find and aggregate commands return a cursor, but we're using .forEach(printjson); at the end to print them out to the console for ease of understanding.

If you observe the preceding example, you should be able to match up much of the same functionality from find. project, sort, and limit are all there as JSON documents just like in the find command. The only noticeable difference with these is that they are now documents in an array instead of functions. The $match stage at the very beginning of our pipeline is the equivalent of our filter document. So, let's break it down step by step:

  1. First, search the theater's collection, to locate documents that match the state MN:

    { $match: { "location.address.state": "MN"} },

  2. Pass this list of theaters to the second stage, which projects only the city the theaters exist in for the selected state:

    { $project: { "location.address.city": 1 } },

  3. This list of cities (and IDs) is then passed to a sort stage, which sorts the data alphabetically by city name:

    { $sort: { "location.address.city": 1 } },

  4. Finally, the list is passed to a limit stage, outputting just the first three entries:

    { $limit: 3 }

Pretty simple, right? You can imagine how large and complex this pipeline could get in production, but one of its strengths is the ability to break down large pipelines into smaller subsections or individual stages. By looking at stages individually and sequentially, seemingly incomprehensible queries can become reasonably straightforward. It's also important to note that the order of the steps is just as important as the stages themselves, not just logically but also to increase performance. The $match and $project stages execute first because these will reduce the size of the result set at each stage. Although not applicable to every type of query, it is generally good practice to try and reduce the number of documents you are working with early on, disregarding any documents that will add excessive loads to the server.

Although the pipeline structure itself is simple, there are more complex stages and operators required to accomplish advanced aggregations, as well as optimize them. We'll look at many of these over the next few topics.

Exercise 7.01: Performing Simple Aggregations

Before we begin this exercise, let's revisit the movie company from the scenario outlined in the Introduction in which a cinema company runs the classic movie marathon every year. In previous years, they have used a manual process for several subcategories before finally merging all the data by hand. As part of your initial research for this task, you are going to try to recreate one of their smaller manual processes as a MongoDB aggregation. This task will make you more familiar with the dataset and create a foundation for more complex queries.

The process you have decided to recreate is as follows:

"Return the top three movies in the romance genre sorted by IMDb rating, and return only movies released before 2001."

This can be done by executing the following steps:

  1. Translate your query into sequential stages that you can map to your aggregation stages: limit to three movies, match only romance movies, sort by IMDb rating, and match only movies released before 2001.
  2. Simplify your stages wherever possible by merging duplicate stages. In this case, you can merge the two match stages: limit to three movies, sort by IMDb rating, and match romance movies released before 2001.

    It's important to remember that the order of the stages is essential and will produce incorrect results unless we rearrange them. To demonstrate this in action, we'll leave them in the incorrect order for now.

  3. Take a quick peek into the structure of the movie documents to help write the stages:

    db.movies.findOne();

    The document appears as follows:

    Figure 7.4: Looking at the document structure (output truncated for brevity)

    Figure 7.4: Looking at the document structure (output truncated for brevity)

    For this particular use case, you will need the imdb.rating, released, and genres fields. Now that you know what you're searching for, you can begin writing up your pipeline.

  4. Create a file called Ch7_Activity1.js and add the following basic stages: limit to limit the output to three movies, sort to sort them by their rating, and match to make sure you only find romantic movies released before 2001:

    // Ch7_Exercise1.js

    var findTopRomanceMovies = function() {

           print("Finding top Classic Romance Movies...");

           var pipeline = [

               { $limit: 3 }, // Limit to 3 results.

                { $sort: {"imdb.rating": -1}}, // Sort by IMDB rating.

          { $match: {. . .}}

            ];

            db.movies.aggregate(pipeline).forEach(printjson);

        }

        findTopRomanceMovies();

    The $match operator functions very similarly to the filter parameter in the find command. You can simply pass in two conditions instead of one.

  5. For the older than 2001 condition, use the $lte operator:

    // Ch7_Exercise1.js

        var findTopRomanceMovies = function() {

            print("Finding top Classic Romance Movies...");

            var pipeline = [

                { $limit: 3 }, // Limit to 3 results.

                { $sort: {"imdb.rating": -1}}, // Sort by IMDB rating.

                { $match: {

                    genres: {$in: ["Romance"]}, // Romance movies only.

                    released: {$lte: new ISODate("2001-01-01T00:00: 00Z") }}},

            ];

            db.movies.aggregate(pipeline).forEach(printjson);

    }

    findTopRomanceMovies();

    Because the genres field is an array (movies can belong to multiple genres), you must use the $in operator to find arrays containing your desired value.

  6. Run this pipeline now; you may notice that it returns no documents:

    MongoDB Enterprise atlas-nb3biv-shard-0:PRIMARY> findTopRomanceMovies();

    Finding top Classic Romance Movies...

    MongoDB Enterprise atlas-nb3biv-shard-0:PRIMARY>

    Is it possible that no documents satisfy this query? Of course, there may be no movies that satisfy all these requirements. However, as you may have already guessed, that is not the case here. As stated earlier, it's the order of this pipeline that's producing misleading results. Because your limit stage is the first stage in your pipeline, you are only ever looking at three documents, and the subsequent stages don't have enough data to find a match. Therefore, it is always important to remember:

    When writing aggregation pipelines, the order of operations matters.

    So, rearrange them to make sure that you only limit your documents at the end of your pipeline. Thanks to the array-like structure of the command, this is quite easy: just cut the limit stage and paste it at the end of your pipeline.

  7. Arrange the stages so that the limit occurs last and does not produce incorrect results:

    // Our new pipeline.

    var pipeline = [

                { $sort: {"imdb.rating": -1}}, // Sort by IMDB rating.

                { $match: {

                    genres: {$in: ["Romance"]}, // Romance movies only.

                    released: {$lte: new ISODate("2001-01-01T00:00: 00Z") }}},

                { $limit: 3 }, // Limit to 3 results (last stage)

            ];

  8. Rerun this after the change. This time, the documents are returned:
    Figure 7.5: Output with valid document return (output truncated for brevity)

    Figure 7.5: Output with valid document return (output truncated for brevity)

    This is one of the challenges of writing aggregation pipelines: it is an iterative process and can be cumbersome when dealing with large numbers of complex documents.

    One way to relieve this pain point is to add stages during development that simplify the data, and then to remove these stages in your final query. In this case, you will add a stage to project only the data you're querying on. This will make it easier to tell whether you're capturing the right conditions. You must be careful when doing this that you do not affect the results of the query. We will discuss this in more detail later in this chapter. For now, you can simply add the projection stage right at the end to ensure that it will not interfere with your query.

  9. Add a projection stage at the end of the pipeline to help debug your query:

    var pipeline = [

        { $sort: {"imdb.rating": -1}}, // Sort by IMDB rating.

        { $match: {

        genres: {$in: ["Romance"]}, // Romance movies only.

        released: {$lte: new ISODate("2001-01-01T00:00:00Z") }}},

        { $limit: 3 }, // Limit to 3 results.

        { $project: { genres: 1, released: 1, "imdb.rating": 1}}

    ];

  10. Run this query again and you will see a much shorter, more easily understood output, as shown in the following code block:
    Figure 7.6: Output for the preceding snippet

Figure 7.6: Output for the preceding snippet

If you're running the code from a file on your desktop, remember that you can simply copy and paste the entire code snippet (as follows) directly into your shell:

// Ch7_Exercise1.js

var findTopRomanceMovies = function() {

    print("Finding top Classic Romance Movies...");

       var pipeline = [

        { $sort: {"imdb.rating": -1}}, // Sort by IMDB rating.

        { $match: {

            genres: {$in: ["Romance"]}, // Romance movies only.

            released: {$lte: new ISODate("2001-01-01T00:00: 00Z") }}},

        { $limit: 3 }, // Limit to 3 results.

        { $project: { genres: 1, released: 1, "imdb.rating": 1}}

];

    db.movies.aggregate(pipeline).forEach(printjson);

}

findTopRomanceMovies();

The output should be as follows:

Figure 7.7: List of the top classic romance movies released before 2001

Figure 7.7: List of the top classic romance movies released before 2001

You can also see that each of the returned movies is in the romance category, was released before 2001, and has a high IMDb rating. So, in this exercise, you have successfully created your first aggregation pipeline. Now, let's take the pipeline we just completed and try to improve it with a little effort. It is often helpful, when you believe you have completed a pipeline, to ask yourself:

"Can I reduce the number of documents being passed down the pipeline?"

In the next exercise, we will try to answer this question.

Exercise 7.02: Aggregation Structure

Think of the pipeline as a multi-tiered funnel. It starts broad at the top and becomes thinner as it approaches the bottom. As you pour documents into the top of the funnel, there are many documents, but as you move further down, this number keeps reducing at every stage, until only the documents that you want as output exit at the bottom. Usually, the easiest way to accomplish this is to do your matching (filtering) first.

In this pipeline, you will sort all the documents in the collection, and discard the ones that don't match. You are currently sorting documents you don't need. Swap those stages around:

  1. Swap the match and sort stages to improve the efficiency of your pipeline:

            var pipeline = [

                { $match: {

                    genres: {$in: ["Romance"]}, // Romance movies only.

                    released: {$lte: new ISODate("2001-01-01T00:00: 00Z") }}},

                { $sort: {"imdb.rating": -1}}, // Sort by IMDB rating.

                { $limit: 3 }, // Limit to 3 results.

                { $project: { genres: 1, released: 1, "imdb.rating": 1}}

    ];

    Another thing to consider is that, although you do have a list of movies matching the criteria, you want your result to be meaningful to your use case. In this case, you want your result to be meaningful and useful to the movie company looking at this data. It is likely that they will care most about the movie title and rating. They may also wish to see that the movie matches their requirements, so let's project those out at the end as well, discarding all other attributes.

  2. Add the movie title field to your projection stage. Your final aggregation should look like this:

    // Ch7_Exercise2.js

    var findTopRomanceMovies = function() {

        print("Finding top Classic Romance Movies...");

        var pipeline = [

            { $match: {

                genres: {$in: ["Romance"]}, // Romance movies only.

                released: {$lte: new ISODate("2001-01-01T00:00: 00Z") }}},

            { $sort: {"imdb.rating": -1}}, // Sort by IMDB rating.

            { $limit: 3 }, // Limit to 3 results.

            { $project: { title: 1, genres: 1, released: 1, "imdb.rating": 1}}

        ];

        db.movies.aggregate(pipeline).forEach(printjson);

    }

    findTopRomanceMovies();

  3. Rerun your pipeline by copying and pasting the code from step 2 into your mongo shell. You should see that the top two movies are Pride and Prejudice and Forrest Gump:
    Figure 7.8: Output for preceding snippet

Figure 7.8: Output for preceding snippet

If you see these results, you have just optimized your first aggregation pipeline.

As you can see, the aggregation pipeline is flexible, robust, and easy to manipulate, but you may be thinking that it seems a little heavy-duty for this use case and that possibly a simple find command might do the trick in most cases. Indeed, the aggregation pipeline is not needed for every simple query, but you're just getting started. In the next few sections, you'll see what the aggregate command provides that the find command does not.

Manipulating Data

Most of our activities and examples can be reduced to the following: there is a document or documents in a collection that should return some or all the documents in an easy-to-digest format. At their core, the find command and aggregation pipeline are just about identifying and fetching the correct document. However, the capability of the aggregation pipeline is much more robust and broader than that of the find command.

Using some of the more advanced stages and techniques in the pipeline allows us to transform our data, derive new data, and generate insights across a broader scope. This more extensive implementation of the aggregate command is more common than merely rewriting a find command as a pipeline. If you want to answer complex questions or extract the highest possible value from your data, you'll need to know how to achieve the aggregation part of your aggregation pipelines.

After all, we haven't even begun to aggregate any data yet. In this topic, we'll explore the basics of how you can begin to transform and aggregate your data.

The Group Stage

As you may expect from the name, the $group stage allows you to group (or aggregate) documents based on a specific condition. Although there are many other stages and methods to accomplish various tasks with the aggregate command, the $group stage serves as the cornerstone of the most powerful queries. Previously, the most significant unit of data we could return was a single document. We can sort these documents to gain insight through a direct comparison of the documents. However, once we master the $group stage, we will be able to increase the scope of our queries to an entire collection by aggregating our documents into large logical units. Once we have the larger groups, we can apply our filters, sorts, limits, and projections just as we did on a per-document basis.

The most basic implementation of a $group stage accepts only an _id key, with the value being an expression. This expression defines the criteria by which the pipeline groups documents together. This value becomes the _id of the newly outputted document with one document generated for each unique _id that the $group stage creates. For example, the following code will group all movies by their rating, outputting a single record for each rating category:

    var pipeline = [

     {$group: {

         _id: "$rated"

     }}

    ];

    db.movies.aggregate(pipeline).forEach(printjson);

The resultant output will be as follows:

Figure 7.9: Resultant output for preceding snippet

Figure 7.9: Resultant output for preceding snippet

The first thing you may notice in our $group stage is the $ notation before the rated field. As stated previously, the value of our _id key was an expression. In aggregation terms, an expression can be a literal, an expression object, an operator, or a field path. In this case, we are passing in a field path, which tells the pipeline which field to access in the input documents. You may or may not have run into field paths before in MongoDB.

You may be wondering why we can't just pass the field name as we would in a find command. This is because when aggregating, we need to tell the pipeline that we want to access the field of the document that it is currently aggregating. The $group stage will interpret _id: "$rated" as equivalent to _id: "$$CURRENT.rated". This may seem complicated, but it indicates that for each document, it will fit into the group matching that same (current) document with the "rated" key. This will become clearer with practice in the next section.

So far, grouping by a single field has been useful to get a list of unique values. However, this hasn't told us much more about our data. We want to know more about these distinct groups; for example, how many titles fit into each of these groups? This is where our accumulator expressions will come in handy.

Accumulator Expressions

The $group command can accept more than just one argument. It can also accept any number of additional arguments in the following format:

field: { accumulator: expression},

Let's break this down into its three components:

  • field will define the key of our newly computed field for each group.
  • accumulator must be a supported accumulator operator. These are a group of operators, like other operators you may have worked with already – such as $lte – except, as the name suggests, they will accumulate their value across multiple documents belonging to the same group.
  • expression in this context will be passed to the accumulator operator as the input of what field in each document it should be accumulating.

Building on the previous example, let's identify the total number of movies in each group:

    var pipeline = [

     {$group: {

         _id: "$rated",

         "numTitles": { $sum: 1},

     }}

    ];

    db.movies.aggregate(pipeline).forEach(printjson);

You can see from this that we can create a new field called numTitles, with the value of this field for each group being the sum of the documents. These newly created fields are often referred to as computed fields. For each document in a group, we can sum the literal value 1 with the accumulated result so far. Running this in the MongoDB shell will give us the following results:

Figure 7.10: Output for preceding snippet

Figure 7.10: Output for preceding snippet

Similarly, instead of accumulating 1 on each document, you can accumulate the value of a given field. For example, let's say we want to find the total runtime of every single film in a rating. We group by the rating field and accumulate the runtime of each film:

    var pipeline = [

     {$group: {

         _id: "$rated",

         "sumRuntime": { $sum: "$runtime"},

     }}

    ];

    db.movies.aggregate(pipeline).forEach(printjson);

Remember, we must prefix the runtime field with the $ symbol to tell MongoDB we are referring to the runtime value of each document we are accumulating. Our new result is as follows:

Figure 7.11:Output for preceding snippet

Figure 7.11:Output for preceding snippet

Although this is a simple example, you can see that with just a single aggregation stage and two parameters, we can begin to transform our data in exciting ways. Several accumulator operators can be combined and layered to generate much more complex and insightful information about groups. We will see some of these operators in the upcoming examples.

It's important to note that we can use more than just accumulator operators as our expressions. We can also use several other useful operators to transform data after accumulating it. Let's say we want to get the average runtime of the titles for each of our groups. We can change our $sum accumulator to $avg, which will return the average runtime across each group, so our pipeline becomes as follows:

    var pipeline = [

     {$group: {

        _id: "$rated",

        "avgRuntime": { $avg: "$runtime"},

     }}

    ];

    db.movies.aggregate(pipeline).forEach(printjson);

And our output becomes:

Figure 7.12:Average runtime values based on rating

Figure 7.12:Average runtime values based on rating

These average runtime values are not particularly useful in this case. Let's add another stage to project the runtime, using the $trunc stage, to give us an integer value:

    var pipeline = [

     {$group: {

         _id: "$rated",

         "avgRuntime": { $avg: "$runtime"},

     }},

     {$project: {

         "roundedAvgRuntime": { $trunc: "$avgRuntime"}

     }}

    ];

    db.movies.aggregate(pipeline).forEach(printjson);

This will give us a much more nicely formatted result, like this:

{ "_id" : "PG-13", "avgRuntime" : 108 }

This section demonstrated how combining the group stage with operators, accumulators, and other stages can help manipulate our data to answer a much broader number of business questions. Now, let's start aggregating and put this new stage into practice.

Exercise 7.03: Manipulating Data

In the previous scenario, you became accustomed to the shape of the data and recreated one of the client's manual processes as an aggregation pipeline. As part of the lead up to the classic movie marathon, the cinema company has decided to try and run one movie for each genre (one per week until the marathon) and they want to run the most popular genres last to build hype around the event. However, they have a problem. Their schedule for these weeks has already been dictated, meaning the classic movies will have to fit into the gaps in the schedule. So, to accomplish this, they must know the length of the longest movie in each genre, including adding time for trailers on each film.

Note

In this scenario, popularity is defined by the IMDb rating, and trailers run for 12 minutes before any film.

The aim can be summarized as follows:

"For only movies older than 2001, find the average and maximum popularity for each genre, sort the genres by popularity, and find the adjusted (with trailers) runtime of the longest movie in each genre."

Translate the query into sequential stages so that you can map to your aggregation stages:

  • Match movies that were released before 2001.
  • Find the average popularity of each genre.
  • Sort the genres by popularity.
  • Output the adjusted runtime of each movie.

Since you've learned more about the group stage, elaborate on that step using your new knowledge:

  • Match movies that were released before 2001.
  • Group all movies by their first genre and accumulate the average and maximum IMDb ratings.
  • Sort by the average popularity of each genre.
  • Project the adjusted runtime as total_runtime.

The following steps will help you complete this exercise.

  1. Create the outline for your aggregation first. Create a new file called Ch7_Exercise3.js:

    // Ch7_Exercise3.js

    var findGenrePopularity = function() {

      print("Finding popularity of each genre");

      var pipeline = [

                { $match: {}},

                { $group: {}},

                { $sort: {}},

                { $project: {}}

            ];

            db.movies.aggregate(pipeline).forEach(printjson);

        }

        findGenrePopularity();

  2. Fill in the steps one at a time, starting with $match:

                { $match: {

                    released: {$lte: new ISODate("2001-01-01T00:00: 00Z") }}},

    This resembles Exercise 7.01, Performing Simple Aggregations, where you matched all the documents released before 2001.

  3. For the $group stage, first identify your new id for each output document:

    { $group: {

        _id: {"$arrayElemAt": ["$genres", 0]},

    }},

    The $arrayElemAt takes an element from an array at the specified index (in this case, 0). For this scenario, assume that the first genre in the array is the primary genre of a film.

    Next, specify the new computed fields you require in the result. Remember to use the accumulator operators, including $avg (average) and $max (maximum). Remember that in accumulator, because you are referencing a variable, you must prefix the field with a $ notation:

    { $group: {

        _id: {"$arrayElemAt": ["$genres", 0]},

        "popularity": { $avg: "$imdb.rating"},

        "top_movie": { $max: "$imdb.rating"},

        "longest_runtime": { $max: "$runtime"}

    }},

  4. Fill in the sort field. Now that you have defined your computed fields, this is simple:

    { $sort: { popularity: -1}},

  5. To get the adjusted runtime, use the $add operator and add 12 (minutes). You add 12 minutes because the client (the cinema company) has informed you that this is the length of the trailers running before each movie. Once you have the adjusted runtime, you will no longer need longest_runtime:

    { $project: {

        _id: 1,

        popularity: 1,

        top_movie: 1,

        adjusted_runtime: { $add: [ "$longest_runtime", 12 ] } } }

  6. Also add a $. Your final aggregation pipeline should look like this:

    var findGenrePopularity = function() {

        print("Finding popularity of each genre");

        var pipeline = [

            { $match: {

            released: {$lte: new ISODate("2001-01-01T00:00:00Z") }}},

            { $group: {

                _id: {"$arrayElemAt": ["$genres", 0]},

                "popularity": { $avg: "$imdb.rating"},

                "top_movie": { $max: "$imdb.rating"},

                "longest_runtime": { $max: "$runtime"}

            }},

                { $sort: { popularity: -1}},

                { $project: {

                    _id: 1,

                    popularity: 1,

                    top_movie: 1,

                    adjusted_runtime: { $add: [ "$longest_runtime", 12 ] } } }

            ];

            db.movies.aggregate(pipeline).forEach(printjson);

        }

        findGenrePopularity();

    If your results are correct, your top few documents should be as follows:

    Figure 7.13:Top few documents returned

Figure 7.13:Top few documents returned

The output shows that noir films, documentaries and short films are the most popular, and we can also see the average runtime for each category. In the next exercise, we will select a title from each category based on certain requirements.

Exercise 7.04: Selecting the Title from Each Movie Category

You have now answered the question posed to you by your client. However, this result won't aid them in picking a specific movie. They must execute a different query to get a list of movies in each genre and pick the best movie to show from the list. Additionally, you have also learned that the maximum time slot available is 230 minutes. You will alter this query to offer the cinema company a recommended title to choose in each category. The following steps will help you complete this exercise:

  1. First, increase the first match to filter out films that aren't applicable. Filter out films longer than 218 minutes (230 plus trailers). Also filter out films with a lower rating. To begin, you'll get movies with a rating above 7.0:

    { $match: {

      released: {$lte: new ISODate("2001-01-01T00:00:00Z") },

      runtime: {$lte: 218},

      "imdb.rating": {$gte: 7.0}

      }

    },

  2. To get the recommended title for each category, use the $first accumulator in our group stage to get the top document (movie) for each genre. To do this, you will have to first sort by rating in descending order, ensuring that the first document is also the highest rated. Add a new $sort stage after the initial $match stage:

    { $sort: {"imdb.rating": -1}},

  3. Now, add the $first accumulator to your group stage, adding your new fields. Also add recommended_rating and recommended_raw_runtime fields for ease of use:

    { $group: {

      _id: {"$arrayElemAt": ["$genres", 0]},

      "recommended_title": {$first: "$title"},

      "recommended_rating": {$first: "$imdb.rating"},

      "recommended_raw_runtime": {$first: "$runtime"},

      "popularity": { $avg: "$imdb.rating"},

      "top_movie": { $max: "$imdb.rating"},

      "longest_runtime": { $max: "$runtime"}

    }},

  4. Ensure that you add this new field to your final projection:

    { $project: {

         _id: 1,

          popularity: 1,

          top_movie: 1,

          recommended_title: 1,

          recommended_rating: 1,

          recommended_raw_runtime: 1,

          adjusted_runtime: { $add: [ "$longest_runtime", 12 ] } } }

    Your new final query should look like this:

    // Ch7_Exercise4js

    var findGenrePopularity = function() {

        print("Finding popularity of each genre");

        var pipeline = [

           { $match: {

            released: {$lte: new ISODate("2001-01-01T00:00:00Z") },

                runtime: {$lte: 218},

                "imdb.rating": {$gte: 7.0}

                }

               },

               { $sort: {"imdb.rating": -1}},

               { $group: {

                 _id: {"$arrayElemAt": ["$genres", 0]},

                 "recommended_title": {$first: "$title"},

                 "recommended_rating": {$first: "$imdb.rating"},

                 "recommended_raw_runtime": {$first: "$runtime"},

                 "popularity": { $avg: "$imdb.rating"},

                 "top_movie": { $max: "$imdb.rating"},

                 "longest_runtime": { $max: "$runtime"}

               }},

               { $sort: { popularity: -1}},

               { $project: {

                    _id: 1,

                     popularity: 1,

                     top_movie: 1,

                     recommended_title: 1,

                     recommended_rating: 1,

                     recommended_raw_runtime: 1,

                     adjusted_runtime: { $add: [ "$longest_runtime", 12 ] } } }

            ];

            db.movies.aggregate(pipeline).forEach(printjson);

        }

        findGenrePopularity();

  5. Execute this, and your first two result documents should look something like the following:
    Figure 7.14:First two result documents

Figure 7.14:First two result documents

You can see that with a few additions to your pipeline, you have extracted the movies with the highest ratings and longest runtimes to create extra value for your client.

In this topic, we saw how we could query data and then sort, limit, and project our results. In this topic, we saw that by using more advanced aggregation stages, we can accomplish much more complicated tasks. Data is manipulated and transformed to create new, meaningful documents. These new stages empower the user to answer a much broader range of more difficult business questions, as well as gain valuable insight into datasets.

Working with Large Datasets

So far, we've been working with a relatively small number of documents. The movies collection has roughly 23,500 documents in it. This may be a considerable number for a human to work with, but for large production systems, you may be working on a scale of millions instead of thousands. So far, we have also been focusing strictly on a single collection at a time, but what if the scope of our aggregation grows to include multiple collections?

In the first topic, we briefly discussed how you could use the projection stage while developing your pipelines to create more readable output as well as simplify your results for debugging. However, we didn't cover how you can improve performance when working on much, much larger datasets, both while developing and for your final production-ready queries. In this topic, we'll discuss a few of the aggregation stages that you need to master when working with large, multi-collection datasets.

Sampling with $sample

The first step in learning how to deal with large datasets is understanding $sample. This stage is simple yet useful. The only parameter to $sample is the desired size of your sample. This stage randomly selects documents (up to your specified size) and passes them through to the next stage:

{ $sample: {size: 100}}, // This will reduce the scope to 100 random docs.

By doing this, you can significantly reduce the number of documents going through your pipeline. Primarily, this is useful for one of two reasons. The first reason is to speed up the execution time when running against enormous datasets—mainly while you are fine-tuning or building your aggregation. The second is for queries where the use case can tolerate documents missing from the result. For example, if you want to return any five films in a genre, you can use $sample:

var findWithSample = function() {

    print("Finding all documents WITH sampling")

    var now = Date.now();

    var pipeline = [

        { $sample: {size: 100}},

        { $match: {

            "plot": { $regex: /around/}

        }}

    ];

    db.movies.aggregate(pipeline)

    var duration = Date.now() - now;

    print("Finished WITH sampling in " + duration+"ms");

}

findWithSample();

The following result will be achieved after executing your new findWithSample() function:

Finding all documents WITH sampling

Finished WITH sampling in 194ms

You may be wondering why you wouldn't just use a $limit command to achieve the same result of reducing the number of documents at some stage in your pipeline. The primary reason is that $limit always respects the order of the documents and thus returns the same documents every time. However, it is important to note that in some cases, where you do not require the pseudo-random selection of $sample, it is wiser to use $limit.

Let's see an example of $sample in action. Here is a query to search all movies for a specific keyword in the plot field, implemented both with and without $sample:

var findWithoutSample = function() {

    print("Finding all documents WITHOUT sampling")

    var now = Date.now();

    var pipeline =[

        { $match: {

            "plot": { $regex: /around/}

        }},

    ]

    db.movies.aggregate(pipeline)

    var duration = Date.now() - now;

    print("Finished WITHOUT sampling in " + duration+ "ms");

}

findWithoutSample();

The preceding example is not the best way to measure performance, and there are much better ways to analyze the performance of your pipelines, such as Explain. However, since we'll cover those in later parts of this book, this will serve as a simple example. If you run this little script, you will get the following result consistently:

Finding all documents WITHOUT sampling

Finished WITHOUT sampling in 862ms

A simple comparison of the two outputs of these two commands is as follows:

Finding all documents WITH sampling 

Finished WITH sampling in 194ms 

Finding all documents WITHOUT sampling

Finished WITHOUT sampling in 862ms

With sampling, the performance is significantly improved. However, this is because we are only looking at 100 documents. More likely, in this case, we want to sample our result after the match statement to make sure we don't exclude all our results in the first stage. In most scenarios, when working on large datasets where the execution time is significant, you may want to sample at the beginning as you construct your pipeline and remove the sample once your query is finalized.

Joining Collections with $lookup

Sampling may assist you when developing queries against extensive collections, but in production queries, you may sometimes need to write queries that are operating across multiple collections. In MongoDB, these collection joins are done using the $lookup aggregation step.

These joins can be easily understood by the following aggregation:

var lookupExample = function() {

    var pipeline = [

        { $match: { $or: [{"name": "Catelyn Stark"}, {"name": "Ned Stark"}]}},

        { $lookup: {

            from: "comments",

            localField: "name",

            foreignField: "name",

            as: "comments"

        }},

  { $limit: 2},

    ];

    db.users.aggregate(pipeline).forEach(printjson);

}

lookupExample();

Let's dissect this before we try to run it. First, we are running a $match against the users collection to get only two users named Ned Stark and Catelyn Stark. Once we have these two records, we perform our lookup. The four parameters of $lookup are as follows:

  • from: The collection we are joining to our current aggregation. In this case, we are joining comments to users.
  • localField: The field name that we are going to use to join our documents in the local collection (the collection we are running the aggregation on). In this case, the name of our user.
  • foreignField: The field that links to localField in the from collection. These may have different names, but in this scenario, it is the same field: name.
  • as: This is how our new joined data will be labeled.

In this example, the lookup takes the name of our user, searches the comments collection, and adds any comments with the same name into a new array field for the original user document. This new array is called comments. In this way, we can fetch an array of all related documents in another collection and embed them in our original documents for use in the rest of our aggregation.

If we were to run the pipeline as it is, the beginning of the output would look something like this:

Figure 7.15:Output after running the pipeline (truncated for brevity)

Figure 7.15:Output after running the pipeline (truncated for brevity)

Because the output is very large, the preceding screenshot shows only the start of the comments array.

In this example, users have made many comments, so the embedded array becomes quite substantial and challenging to view. This issue presents an excellent place to introduce the $unwind operator, as these joins can often result in large arrays of related documents. $unwind is a relatively simple stage. It deconstructs an array field from an input document to output a new document for each element in the array. For example, if you unwind this document:

{a: 1, b: 2, c: [1, 2, 3, 4]}

The output will be the following documents:

{"a" : 1, "b" : 2, "c" : 1 }

{"a" : 1, "b" : 2, "c" : 2 }

{"a" : 1, "b" : 2, "c" : 3 }

{"a" : 1, "b" : 2, "c" : 4 }

We can add this new stage to our join and try running it:

var lookupExample = function() {

    var pipeline = [

        { $match: { $or: [{"name": "Catelyn Stark"}, {"name": "Ned Stark"}]}},

        { $lookup: {

            from: "comments",

            localField: "name",

            foreignField: "name",

            as: "comments"

        }},

        { $unwind: "$comments"},

        { $limit: 3},

    ];

    db.users.aggregate(pipeline).forEach(printjson);

}

lookupExample();

We will see output like this:

Figure 7.16:Output for preceding snippet (truncated for brevity)

Figure 7.16:Output for preceding snippet (truncated for brevity)

We can see multiple documents per user with a single document for each comment instead of one embedded array. With this new format, we can add more stages to operate on our new set of documents. For example, we may wish to filter out any comments on a specific movie or sort our comments by their date. This combination of $lookup and $unwind is a powerful combination for answering complex questions across multiple collections in a single aggregation.

Outputting Your Results with $out and $merge

Imagine that we've been working on a large, multi-stage aggregation pipeline over the last week. We have been debugging, sampling, filtering, and testing our pipeline to solve a challenging and complex business problem on a tremendously large dataset. We're finally happy with our pipeline, and we want to execute it and then save the results for subsequent analysis and presentation.

We could run the query and export the results into a new format. However, this would mean re-importing the results if we wanted to run subsequent analysis on the result set.

We could save the output in an array and then re-insert it into MongoDB, but that would mean transferring all the data from the server to the client, and then back from the client to the server.

Luckily for us, from MongoDB version 4.2 onward, we are provided with two aggregation stages that solve this problem for us: $out and $merge. Both stages allow us to take the output from our pipeline and write it into a collection for later use. Importantly, this whole process takes place on the server, meaning that all the data never needs to be transferred to the client across the network. It's not hard to imagine that after creating a complicated aggregation query, you may want to run it once a week and create a snapshot of your result by writing that data into a collection.

Let's look at the syntax of both these stages in their most basic form, and then we can compare how they function:

// Available from v2.6

{ $out: "myOutputCollection"}

// Available from version 4.2

{ $merge: {

    // This can also accept {db: <db>, coll: <coll>} to merge into a different db

    into: "myOutputCollection",

}}

As you can see, the syntax without any optional parameters is almost identical. In every other regard, however, the two commands diverge. $out is very simple; the only parameter to specify is the desired output collection. It will either create a new collection or completely replace an existing collection. $out also has several constraints not shared with $merge. For example, $out must output to the same database as the aggregation target.

When running on a MongoDB 4.2 server, $merge will probably be the better option. However, for the scope of this book, we will be using the MongoDB free tier, which runs MongoDB 4.0. Therefore, we will focus more on the $out stage in these examples.

The syntax for $out is very simple. The only parameter is the collection to which we want to output our result. Here is an example of a pipeline with $out:

var findTopRomanceMovies = function() {

    var pipeline = [

        { $sort: {"imdb.rating": -1}}, // Sort by IMDB rating.

        { $match: {

            genres: {$in: ["Romance"]}, // Romance movies only.

            released: {$lte: new ISODate("2001-01-01T00:00: 00Z") }}},

        { $limit: 5 },                 // Limit to 5 results.

        { $project: { title: 1, genres: 1, released: 1, "imdb.rating": 1}},

        { $out: "movies_top_romance"}

    ];

    db.movies.aggregate(pipeline).forEach(printjson);

}

findTopRomanceMovies();

By running this pipeline, you will receive no output. This is because the output has been redirected to our desired collection:

MongoDB Enterprise atlas-nb3biv-shard-0:PRIMARY> findTopRomanceMovies();

MongoDB Enterprise atlas-nb3biv-shard-0:PRIMARY>

We can see that a new collection was created with our result:

MongoDB Enterprise atlas-nb3biv-shard-0:PRIMARY> show collections

comments

movies

movies_top_romance

sessions

theaters

users

And if we run a find on our new collection, we can see that the results of our aggregation are now stored within it:

MongoDB Enterprise atlas-nb3biv-shard-0:PRIMARY> db.movies_top_romance.findOne({})

{

        "_id" : ObjectId("573a1399f29313caabceeead"),

        "genres" : [

                "Drama",

                "Romance"

        ],

        "title" : "Pride and Prejudice",

        "released" : ISODate("1996-01-14T00:00:00Z"),

        "imdb" : {

                "rating" : 9.1

        }

}

By placing our results into a collection, we can store, share, and update new complex aggregation results. We can even run further queries and aggregations against this new collection. $out is a simple but powerful aggregation stage.

Exercise 7.05: Listing the Most User-Commented Movies

The cinema company wishes to learn more about which movies generate the most comments from their users. However, given many comments in the database (and your disposition to use your newly learned skills), you have decided that while developing this pipeline, you will use only a sample of the comments. From this sample, you will figure out the most talked-about movies and join these documents with the document in the movies collection to get more information about the film. The company has also requested that the final deliverable of your work is a new collection with the output documents. This requirement should be easy to satisfy given that you now know the $merge stage.

Some additional information you have gathered is that they wish for the result to be as simple as possible and they wish to know the movie title and rating. Additionally, they would like to see the top five most commented-on movies.

In this exercise, you will help the cinema company to obtain a list of movies that generate the most comments from users. Perform the following steps to complete this exercise:

  1. First, outline the stages in your pipeline; they appear in the following order:

    $sample the comments collection (while building the pipeline).

    $group the comments by the movie for which they are targeted.

    $sort the result by the number of total comments.

    $limit the result to the top five movies by comments.

    $lookup the movie that matches each document.

    $unwind the movie array to keep the result documents simple.

    $project just the movie title and rating.

    $merge the result into a new collection.

    Although this may seem like many stages, each stage is relatively simple, and the process can be followed logically from beginning to end.

  2. Create a new file called Ch7_Exercise5.js and write up your pipeline skeleton:

    // Ch7_Exercise5.js

    var findMostCommentedMovies = function() {

        print("Finding the most commented on movies.");

        var pipeline = [

                 { $sample: {}},

                 { $group: {}},

                 { $sort: {}},

                 { $limit: 5},

                 { $lookup: {}},

                 { $unwind: },

                 { $project: {}},

                 { $out: {}}

        ];

        db.comments.aggregate(pipeline).forEach(printjson);

    }

    findMostCommentedMovies();

  3. Before deciding on sample size, you should get a sense of how large the comments collection is. Run count against the comments collection:

    MongoDB Enterprise atlas-nb3biv-shard-0:PRIMARY> db.comments.count()

    50303

  4. Sample roughly ten percent of the collection while you're developing. Set the sample size to 5000 for this exercise:

    { $sample: {size: 5000}},

  5. Now that you have the easier steps out of the way, fill in the $group statement to group the comments by their associated film, accumulating the total number of comments for each film:

    { $group: {

        _id: "$movie_id",

        "sumComments": { $sum: 1}

    }},

  6. Next up, add sort so the movies with the highest sumComments value are first:

    { $sort: { "sumComments": -1}},

  7. When building pipelines, it's important to periodically run them partially completed to make sure you see the results you're expecting. Since you're about halfway through the stages, quickly comment out the incomplete stages and run the aggregation to list your most-commented movies. Keep in mind that because you are sampling, the results will not be the same each time you run your pipeline. The following output is just an example:
    Figure 7.17: Example output

    Figure 7.17: Example output

    Our output will appear as follows:

    Figure 7.18: Output after running the aggregation pipeline (truncated for brevity)

    Figure 7.18: Output after running the aggregation pipeline (truncated for brevity)

    You now need to perform a lookup into the movies collection to match your comment groups with the movie documents:

    { $lookup: {

        from: "movies",

        localField: "_id",

        foreignField: "_id",

        as: "movie"

    }},

    Rerunning this, you can now see a movie array with all the movie details embedded within it:

    Figure 7.19: Output after re-running the pipeline

    Figure 7.19: Output after re-running the pipeline

    There is only one movie in each movie array, so unwind those arrays to simplify the structure. Once it is unwound, you can project out all the fields you don't care to see. Now, fill in these two steps:

    { $unwind: "$movie" },

    { $project: {

        "movie.title": 1,

        "movie.imdb.rating": 1,

        "sumComments": 1,

    }}

  8. Your data is now complete, but you still need to output this result into a collection. Add the $out step at the end:

    { $out: "most_commented_movies" }

    Your final resulting code should look something like this:

    // Ch7_Exercise5.js

    var findMostCommentedMovies = function() {

        print("Finding the most commented on movies.");

        var pipeline = [

                 { $sample: {size: 5000}},

                 { $group: {

                     _id: "$movie_id",

                     "sumComments": { $sum: 1}

                 }},

                 { $sort: { "sumComments": -1}},

                 { $limit: 5},

                 { $lookup: {

                     from: "movies",

                     localField: "_id",

                     foreignField: "_id",

                     as: "movie"

                 }},

                 { $unwind: "$movie" },

                 { $project: {

                     "movie.title": 1,

                     "movie.imdb.rating": 1,

                     "sumComments": 1,

                 }},

                 { $out: "most_commented_movies" }

        ];

        db.comments.aggregate(pipeline).forEach(printjson);

    }

    findMostCommentedMovies();

    Run this code. If all goes well, you will notice no output from your pipeline in the shell, but you should be able to check your newly created collection using find() and see your result. Remember, due to your sampling stage, the results will not be the same every time:

    Figure 7.20: Results from preceding snippet (output truncated for brevity)

Figure 7.20: Results from preceding snippet (output truncated for brevity)

With the new phases we have learned about in this topic, we now possess an excellent foundation for performing aggregations on more massive, more complex datasets. Moreover, importantly, we are now able to join data between multiple collections effectively. By doing this, we can increase the scope of our queries and thus satisfy a much broader range of use cases.

With the out stage, we can store the result of our aggregations. This allows users to explore the results quickly with normal CRUD operations and allows us to keep updating the results regularly and easily. The unwind stage has also given us the ability to take the joined documents from a lookup and separate them into individual documents that we can feed into further pipeline stages.

With all these stages combined, we are now able to create extensive new aggregations that operate across large, multi-collection datasets.

Getting the Most from Your Aggregations

In the last three topics, we have learned about the structure of aggregation as well as the key stages required to build up complicated queries. We can search large multi-collection datasets with given criteria, manipulate that data to create new insights, and output our results into a new or existing collection.

These fundamentals will allow you to solve most of the problems you will encounter in an aggregation pipeline. However, there are several other stages and patterns for getting the most out of your aggregations. We won't cover them all in this book, but in this topic, we'll discuss a few of the odds and ends that will help you fine-tune your pipelines as well as some other odds and ends that we simply haven't covered so far. We'll be looking at aggregation options using Explain to analyze your aggregation.

Tuning Your Pipelines

In an earlier topic, we timed the execution of our pipeline by outputting the time before and after our aggregation. This is a valid technique, and you may often time your MongoDB queries on the client or application side. However, this only gives us a rough approximation of duration and only tells us the total time the response took to reach the client, not how long the server took to execute the pipeline. MongoDB provides us with a great way of learning exactly how it executed our requested query. This feature is known as Explain and is the usual way to examine and optimize our MongoDB commands.

However, there is one catch. Explain does not yet support detailed execution plans for aggregations, meaning its use is limited when it comes to the optimization of pipelines. Explain and execution plans will be covered in more detail later in this book. Since we can't rely on Explain to analyze our pipelines, it becomes even more integral to carefully construct and plan our pipeline to improve the performance of our aggregations. Although there is no single correct method that will work in any situation, there are some heuristics that can generally be helpful. We'll walk through a few of these methods with examples. MongoDB does a lot of performance optimization under the hood, but these are still good patterns to follow.

Filter Early and Filter Often

Each stage of the aggregation pipeline will perform some processing on the input. That means the more significant the input, the larger the processing. If you've designed your pipeline correctly, this processing is unavoidable for the documents you are trying to return. The best you can do is to make sure you're processing only the documents you want to return.

The easiest way to accomplish this is by adding or moving pipeline stages that filter out documents. We've already done this in our previous scenarios with $match and $limit. A common way to ensure this is to have the very first stage in your pipeline be a $match, which matches only documents you need later in the pipeline. Let's understand this with the help of the following pipeline example, where the pipeline is not designed to execute as expected:

var badlyOrderedQuery = function() {

  print("Running query in bad order.")

  var pipeline = [

    { $sort: {"imdb.rating": -1}}, // Sort by IMDB rating.

    { $match: {

        genres: {$in: ["Romance"]}, // Romance movies only.

        released: {$lte: new ISODate("2001-01-01T00:00:00Z") }}},

    { $project: { title: 1, genres: 1, released: 1, "imdb.rating": 1}},

    { $limit: 1 }, // Limit to 1 result.

  ];

  db.movies.aggregate(pipeline).forEach(printjson);

}

badlyOrderedQuery();

The output will be as follows:

MongoDB Enterprise atlas-nb3biv-shard-0:PRIMARY> badlyOrderedQuery();

Running query in bad order.

{

        "_id" : ObjectId("573a1399f29313caabceeead"),

        "genres" : [

                "Drama",

                "Romance"

        ],

        "title" : "Pride and Prejudice",

        "released" : ISODate("1996-01-14T00:00:00Z"),

        "imdb" : {

                "rating" : 9.1

        }

}

Once you have correctly ordered the pipeline, it will look like the following:

var wellOrderedQuery = function() {

print("Running query in better order.")

var pipeline = [

    { $match: {

        genres: {$in: ["Romance"]}, // Romance movies only.

        released: {$lte: new ISODate("2001-01-01T00:00:00Z") }}},

    { $sort: {"imdb.rating": -1}}, // Sort by IMDB rating.

    { $limit: 1 },                 // Limit to 1 result.

    { $project: { title: 1, genres: 1, released: 1, "imdb.rating": 1}},

];

db.movies.aggregate(pipeline).forEach(printjson);

}

wellOrderedQuery();

This will result in the following output:

Figure 7.21: Output for preceding snippet (truncated for brevity)

Figure 7.21: Output for preceding snippet (truncated for brevity)

Logically, this change means that the first thing we do is get a list of all our eligible documents before sorting them, and then we take the top five and project only those five documents.

Both pipelines output the same results, but the second is much more robust and easily understood. You may not always see a significant performance increase with this change, particularly on smaller datasets. However, this is an excellent practice to follow because it will assist you in creating logical, efficient, and straightforward pipelines that can be modified or scaled more easily.

Use Your Indexes

Indexes are another critical element in MongoDB query performance. This book covers indexes and their creation in further depth in Chapter 9, Performance. All you need to remember when creating your aggregations is that when utilizing stages such as $sort and $match, you want to make sure that you are operating on correctly indexed fields. The concepts around using indexes will then become more apparent.

Think about the Desired Output

One of the most important ways to improve your pipelines is to plan and evaluate them to ensure that you're getting the desired output that solves your business problem. Ask yourself the following questions if you're having trouble creating a finely tuned pipeline:

  • Am I outputting all the data to solve my problem?
  • Am I outputting only the data required to solve my problem?
  • Am I able to merge or remove any intermediate steps?

If you have evaluated your pipeline, tuned it, and still find it to be over-complicated or inefficient, you may need to ask some questions about the data itself. Is the aggregation difficult because the wrong query is being designed, or even the wrong question being asked? Alternatively, perhaps it is a sign that the shape of the data needs to be re-evaluated.

Aggregation Options

Altering the pipeline is where you may spend most of your time while working with aggregations, and for beginners, you will likely be able to accomplish most of your goals by just writing pipelines. As mentioned earlier in this chapter, several options can be passed into the aggregate command to configure its operation. We won't delve too deeply into these options, but it is helpful to recognize them. The following is an example of aggregation with some of our options included:

    var options = {

        maxTimeMS: 30000,

        allowDiskUse: true

        }

    db.movies.aggregate(pipeline, options);

To specify these options, a second parameter is passed into the command after the pipeline array. In this case, we've called it options. Some of the options to be aware of include the following:

  • maxTimeMS: The amount of time an operation may be processed before MongoDB kills it. Essentially a timeout for your aggregation. The default for this is 0, which means operations do not time out.
  • allowDiskUse: Stages in the aggregation pipeline may only use up a maximum amount of memory, making it challenging to handle massive datasets. By setting this option to true, MongoDB can write temporary files to allow the handling of more data.
  • bypassDocumentValidation: This option is specifically for pipelines that will be writing out to collections using $out or $merge. If this option is set to true, document validation will not occur on documents being written to the collection from this pipeline.
  • comment: This option is just for debugging and allows a string to be specified that helps identify this aggregation when parsing database logs.
  • Let's perform an exercise now, to put the concepts we learnt about till now, into practice.

Exercise 7.06: Finding Award-Winning Documentary Movies

After seeing the results of the aggregation pipelines achieved in the previous exercises and the value they are bringing to the cinema company, a few of the company's internal engineers have tried to write up some new aggregations themselves. The cinema company has asked you to review these pipelines to assist in their internal engineers' learning process. You will use some of the preceding techniques and your understanding of aggregations from the last three topics to fix up a pipeline. The goal of this simple pipeline is to get a list of documentary movies with a high rating.

For this scenario, you will also work under the assumption that there is a substantial amount of data in the collection. The pipeline given to you to be reviewed is as follows. The purpose of this exercise is to find a few award-winning documentary movies and then list the movies that have won the most awards:

var findAwardWinningDocumentaries = function() {

    print("Finding award winning documentary Movies...");

    var pipeline = [

        { $sort: {"awards.wins": -1}}, // Sort by award wins.

        { $match: {"awards.wins": { $gte: 1}}},

        { $limit: 20}, // Get the top 20 movies with more than one award

        { $match: {

            genres: {$in: ["Documentary"]}, // Documentary movies only.

        }},

        { $project: { title: 1, genres: 1, awards: 1}},

        { $limit: 3},

    ];

    var options = { }

    db.movies.aggregate(pipeline, options).forEach(printjson);

}

findAwardWinningDocumentaries();

The result can be achieved through the following steps:

  1. First, merge the two $match statements and move match to the top of the pipeline:

    var pipeline = [

        { $match: {

            "awards.wins": { $gte: 1},

            genres: {$in: ["Documentary"]},

        }},

        { $sort: {"awards.wins": -1}}, // Sort by award wins.

        { $limit: 20}, // Get the top 20 movies.

        { $project: { title: 1, genres: 1, awards: 1}},

        { $limit: 3},

    ];

  2. sort is no longer needed at the beginning, so you can move it to the second-to-last step:

    var pipeline = [

        { $match: {

            "awards.wins": { $gte: 1},

            genres: {$in: ["Documentary"]},

        }},

        { $limit: 20}, // Get the top 20 movies.

        { $project: { title: 1, genres: 1, awards: 1}},

        { $sort: {"awards.wins": -1}}, // Sort by award wins.

        { $limit: 3},

    ];

  3. The two limits are no longer required. Delete the first one:

    var pipeline = [

        { $match: {

            "awards.wins": { $gte: 1},

            genres: {$in: ["Documentary"]},

        }},

        { $project: { itle: 1, genres: 1, awards: 1}},

        { $sort: {"awards.wins": -1}}, // Sort by award wins.

        { $limit: 3},

    ];

  4. Finally, move the projection to the very end, as it only needs to operate on the final three documents:

    var pipeline = [

        { $match: {

            "awards.wins": { $gte: 1},

            genres: {$in: ["Documentary"]},

        }},

        { $sort: {"awards.wins": -1}}, // Sort by award wins.

        { $limit: 3},

        { $project: { title: 1, genres: 1, awards: 1}},

    ];

  5. That's already looking much better. You've been told that the collection is vast, so also add some options to the aggregation:

    var options ={

            maxTimeMS: 30000,

            allowDiskUse: true,

            comment: "Find Award Winning Documentary Films"

        }

        db.movies.aggregate(pipeline, options).forEach(printjson);

  6. Run the full query:

    var findAwardWinningDocumentaries = function() {

        print("Finding award winning documentary Movies...");

        var pipeline = [

            { $match: {

                "awards.wins": { $gte: 1},

                genres: {$in: ["Documentary"]},

            }},

            { $sort: {"awards.wins": -1}}, // Sort by award wins.

            { $limit: 3},

            { $project: { title: 1, genres: 1, awards: 1}},

        ];

        

        var options ={

            maxTimeMS: 30000,

            allowDiskUse: true,

            comment: "Find Award Winning Documentary Films"

        }

        db.movies.aggregate(pipeline, options).forEach(printjson);

    }

    findAwardWinningDocumentaries();

    So, your result should be as follows:

    Figure 7.22: List of award-winning documentaries (truncated for brevity)

Figure 7.22: List of award-winning documentaries (truncated for brevity)

With this, you have retrieved the award-winning documentary list as per your cinema company's requirements. We have seen in this topic that to get the most value from your aggregations, you will be required to design, test, and continually re-evaluate your pipeline. The heuristics listed previously are just a small fraction of the patterns for designing useful aggregations, however, and researching other patterns and procedures is always recommended.

We also saw how we could pass in some options to the aggregate command to assist us in specific use cases or with massive datasets that may take longer to process.

Activity 7.01: Putting Aggregations into Practice

The cinema company from previous exercises has been very impressed with the insights you've managed to extract from the data using aggregation pipelines. However, the company is having trouble managing the different queries and combining the data into meaningful results. They have decided that they would like a single, unified aggregation that summarizes the essential information for their upcoming movie marathon campaign.

You aim to design, test, and run an aggregation pipeline that will create this unified view. You should ensure that the final output of the aggregation answers the following business problems:

  • For each genre, which movie has the most award nominations, given that they have won at least one of these nominations?
  • For each of these movies, what is their appended runtime, given that each movie has 12 minutes of trailers before it?
  • An example of the sorts of things users are saying about this film.
  • Because this is a classic movie marathon, only movies released before 2001 are eligible.
  • Across all genres, list all the genres that have the highest number of award wins.

You may complete this activity in whichever way you choose, but try to focus on creating a simple and efficient aggregation pipeline that can be tweaked or modified in the future. It is sometimes best to try and decide what an output document might look like, and then work backward from there.

Remember, you may also choose to use the $sample stage to speed up your query while you're testing, but you must remove these steps in the final solution.

To keep the desired output simple, limit the result to three documents for this scenario.

The following steps will help you to complete this task:

  1. Filter out any documents that were not released before 2001.
  2. Filter out any documents that do not have at least one award win.
  3. Sort the documents by award nominations.
  4. Group the documents into a genre.
  5. Take the first film in each group.
  6. Take the total number of award wins for each group.
  7. Join with the comments collection to get a list of comments for each film.
  8. Reduce the number of comments for each film to one using projection. (Hint: use the $slice operator to reduce array length.)
  9. Append the trailer time of 12 minutes to each film's runtime.
  10. Sort our result by the total number of award wins.
  11. Impose a limit of three documents.

The desired output is follows:

Figure 7.23: Final output after executing activity steps

Figure 7.23: Final output after executing activity steps

Note

The solution for this activity can be found via this link.

Summary

In this chapter, we have covered all the essential components that you need to understand, write, comprehend, and improve MongoDB aggregations. This new functionality will help you to answer more complex and difficult questions about your data. By creating multi-stage pipelines that join multiple collections, you can increase the scope of your queries to the entire database instead of a single collection. We also looked at how to write the results into a new collection to enable further exploration or manipulation of the data.

In the final section, we covered the importance of ensuring that your pipelines are written with scalability, readability, and performance in mind. By focusing on these aspects, your pipelines will continue to deliver value in the future and can act as a basis for further aggregations.

However, what we have covered here is just the beginning of what you can accomplish with the aggregation feature. It is critical that you keep exploring, experimenting, and testing your pipelines to truly master this MongoDB skill.

In the next chapter, we will walk through the creation of an application in Node.js with MongoDB as a backend. Even if you're not a developer, this will give you meaningful insight into how MongoDB applications are built, along with a deeper understanding of building and executing dynamic queries.